STOCK SENTIMENT ANALYSIS based on News Articles by Cathrine Xavier (GH1027014)¶

BUSINESS PROBLEM¶

Ideally the Rookie investers are motivated to make millions investing in stocks. However in reality it takes immense research into Market behaviour, business knowledge,thorough reading , hands-on experience for several period and funds to play around in Stock market. Of late ,The relationship between supply and demand is highly sensitive to the news of the moment. Specifically, reports from governments are always news, as they suggest the strength or weakness of the economy.¶

Hence my idea is to train NLP Machine Learning models to analyse market sentiment with the Stock sentiment data and news, and by that test the best fitting model to create a system that can provide discreet insights on Marketing sentiments ,which can be compared with Stock prices there by reducing risk of Loss and improving the trading strategies. My Stake holders are Financial analysts , Investment and Portfolio Managment team , Executive leadership and Board members.¶

System architecture :¶

Input -Raw and unquantified Stock news and articles¶

Output-Predicted stock sentiment by positive or negative category.¶

Components:¶

1- Data preparation and Preprocessing¶

2- Text vectorization¶

3- Model Training¶

4- Model Evaluation¶

5- Deployment¶

EXPECTATIONS :¶

- To indicate the anticipations of certain market activity before it occurs based on News headlines and articles.¶

- Forming a base strategy for trading and market fundamentals together with historical analysis & technical analysis of measuring a stock's intrinsic value particularly for specific stocks or sectors.¶

- Get a solid understanding of finance and economic indicators driven by positive or negative news influence as professional traders anticipate news cycle when there are no hard facts from the company or if the reports are lagging .¶

My Dataset is from Kaggle's website and has three columns¶

- Unnamed (Serial number)¶

- Sentiment¶

- Sentence¶

Importing required libraries¶

In [3]:
import pandas as pd
import numpy as np
import re
import string
import nltk
from nltk import download
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import contractions
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from bs4 import BeautifulSoup
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.svm import SVC
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from wordcloud import WordCloud
import plotly.graph_objects as go
import plotly.express as px
import seaborn as sns
from scipy.interpolate import make_interp_spline

Step 1: DATA ACQUISITIION¶

In [4]:
df = pd.read_csv("https://drive.google.com/uc?export=download&id=1AN2n1n0UyhVOb-9AFA7dSnJQPuYwl-Sb")

#DataSet
In [5]:
df = pd.DataFrame(df)

#Dataframe
In [6]:
type(df)

#Dataframetype
Out[6]:
pandas.core.frame.DataFrame

Step 2: DATA VISUALIZATION¶

Visuals of raw data¶

PIE CHART (DONUT)¶

In [7]:
sentiment_counts = df['Sentiment'].value_counts()


labels = ['Positive', 'Negative']
colors = ['#4CAF50', '#FF5252']


explode = (0.1, 0) 

plt.figure(figsize=(8, 8))
plt.pie(sentiment_counts, explode=explode, labels=labels, colors=colors,
        autopct='%1.1f%%', startangle=140, wedgeprops=dict(edgecolor='k'))


centre_circle = plt.Circle((0, 0), 0.70, fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)


plt.title('Sentiment Distribution')
plt.axis('equal')
plt.show()



# Each sentiment is correctly counted | labels defined| 
No description has been provided for this image

There are no visible imbalances in Dataset and apparently data seems to be unbiased¶

HISTOGRAM¶

In [8]:
sns.histplot(df['Sentiment'], kde=True)
Out[8]:
<Axes: xlabel='Sentiment', ylabel='Count'>
No description has been provided for this image
In [9]:
df.info()

#checking datatype
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 108751 entries, 0 to 108750
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  108751 non-null  int64 
 1   Sentiment   108751 non-null  int64 
 2   Sentence    108750 non-null  object
dtypes: int64(2), object(1)
memory usage: 2.5+ MB

Step 3: TEXT CLEANING & PREPROCESSING¶

So, Inorder to avoid overfitting or underfitting, I have taken sample data that can help speed up the training and evaluation of the models.¶

In [10]:
df = df.head(5000)
df.head()
Out[10]:
Unnamed: 0 Sentiment Sentence
0 0 0 According to Gran , the company has no plans t...
1 1 1 For the last quarter of 2010 , Componenta 's n...
2 2 1 In the third quarter of 2010 , net sales incre...
3 3 1 Operating profit rose to EUR 13.1 mn from EUR ...
4 4 1 Operating profit totalled EUR 21.1 mn , up fro...

As can be seen, there are no null values in the dataset which is a plus¶

In [11]:
df.isnull().sum()

#checking for null values
Out[11]:
Unnamed: 0    0
Sentiment     0
Sentence      0
dtype: int64

I have done all possible streamlining to remove noise from the 'Sentence texts' in the Sentence column. I have included all the clean up preprocessing steps such as stemming, Tokenizing,lemming, lowercasing, removing extra white spaces, punctuations, digits,links, special characters, expand contractions to remove irrelevant data for extracting features and provide plain text for the requirement¶

In [12]:
def text_unclutter(text):

    #1. remove html function
    def strip_html(text):
        if isinstance(text, str):
            soup = BeautifulSoup(text, "html.parser")
            return soup.get_text()
        else:
            return ""
        soup = BeautifulSoup(text, "html.parser")
        return soup.get_text()
    
    #2. denoise text (remove HTML markup)
    def denoise_text(text):
        if isinstance(text, str):
            text = strip_html(text)
            return text
        else:
            return ""
    
    #3. Expand contractions
    text = contractions.fix(text)
    
    #4. Convert to lowercase
    text = text.lower()
    
    #5. Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    text=''.join([char for char in text if char not in string.punctuation])
    
    #6. Remove numbers/digits
    text = re.sub(r'\d+', '', text)
    text=re.sub(r'\b[0-9]+\b\s*','',text)
    
    #7. Remove URL,hastags,mentions, and special characters
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    text=re.sub(r"http\s+|www\s+|@\w+|#\w+","",text)
    text=re.sub(r"[^\w\s]","",text)
    
    #8. Remove extra whitespaces
    text = text.strip()
    
    #9. Tokenize the text
    tokens = word_tokenize(text)
    
    #10. Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens=[token for token in tokens if token not in stop_words]
    
    #11. Lemmatize the words
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    
    #12. Join tokens back to a single string
    cleaned_text = ' '.join(tokens)
    
    return cleaned_text
In [13]:
df['Cleaned_Sentence'] = df['Sentence'].apply(text_unclutter)
In [14]:
# Sort the Sentences by their Sentiment 
positive_text = ' '.join(df[df['Sentiment'] == 1]['Cleaned_Sentence'].tolist())
negative_text = ' '.join(df[df['Sentiment'] == 0]['Cleaned_Sentence'].tolist())

WORD CLOUDS for each sentiment category¶

In [15]:
wordcloud_positive = WordCloud(width=800, height=400, background_color='white').generate(positive_text)
wordcloud_negative = WordCloud(width=800, height=400, background_color='white').generate(negative_text)


image_positive = wordcloud_positive.to_image()
image_negative = wordcloud_negative.to_image()

fig_positive = go.Figure(go.Image(z=image_positive))
fig_positive.update_layout(
    title_text='Positive Sentiment Word Cloud',
    width=1000,  
    height=600   
)

fig_negative = go.Figure(go.Image(z=image_negative))
fig_negative.update_layout(
    title_text='Negative Sentiment Word Cloud',
    width=1000,  
    height=600   
)

fig_positive.show()
fig_negative.show()

FEATURE ENGINEERING¶

Bag of Words¶

The analysis of words from the column 'Sentence' involves identifying their class, hence I have used BoW to map words¶

In [16]:
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(df['Cleaned_Sentence'])
bow_df = pd.DataFrame(bow_matrix.toarray(), columns=vectorizer.get_feature_names_out())

TF-IDF¶

It is crucial to Study Market Sentiment as it holds greater significance in the ever changing market conditions even though it's mildly influenced by market weather it creates a major impact. Therefore I have used TF-IDF to analyse the importance of specific word relative to other words. Hence I have vectorized the text data into numeric vector that provides the importance of each word in the context.¶

In [17]:
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(df['Cleaned_Sentence'])
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

"Let's explore the Machine Learning Models"¶

NAIVE BAYES CLASSIFICATION¶

In [18]:
X = tfidf_matrix
y = df['Sentiment']
# Data split X,Y
In [19]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
## training and test sets
In [20]:
# Train Naive Bayes classifier
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train, y_train)
Out[20]:
MultinomialNB()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
MultinomialNB()
In [21]:
# Predictions on test data
y_pred = nb_classifier.predict(X_test)
In [22]:
# Evaluate model performance
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
Accuracy: 0.772

Classification Report:
              precision    recall  f1-score   support

           0       0.76      0.98      0.85       683
           1       0.89      0.32      0.47       317

    accuracy                           0.77      1000
   macro avg       0.82      0.65      0.66      1000
weighted avg       0.80      0.77      0.73      1000

In [23]:
#Visualizing Model Results
cmap = sns.light_palette("purple", as_cmap=True)
t1 = ConfusionMatrixDisplay(confusion_matrix(y_test, y_pred))
t1.plot(cmap=cmap, values_format='d', ax=plt.gca())
plt.show()
No description has been provided for this image

SUPPORT VECTOR MACHINES¶

In [24]:
# Train SVM classifier
svm_classifier = SVC(kernel='linear')
svm_classifier.fit(X_train, y_train)
Out[24]:
SVC(kernel='linear')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
SVC(kernel='linear')
In [25]:
# Predictions
y_pred = svm_classifier.predict(X_test)
In [26]:
# Evaluate model performance
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
Accuracy: 0.86

Classification Report:
              precision    recall  f1-score   support

           0       0.84      0.97      0.90       683
           1       0.90      0.61      0.73       317

    accuracy                           0.86      1000
   macro avg       0.87      0.79      0.82      1000
weighted avg       0.86      0.86      0.85      1000

In [27]:
#Visualizing Model Results

cmap = sns.light_palette((210, 90, 60), input="husl", as_cmap=True)
t1 = ConfusionMatrixDisplay(confusion_matrix(y_test, y_pred))
t1.plot(cmap=cmap, values_format='d', ax=plt.gca())
plt.show()
No description has been provided for this image

Logistic Regression¶

In [28]:
# Train LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
Out[28]:
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
In [29]:
# Predictions
y_pred = model.predict(X_test)
In [30]:
print('Classification Report:\n', classification_report(y_test, y_pred))
Classification Report:
               precision    recall  f1-score   support

           0       0.80      0.99      0.88       683
           1       0.94      0.47      0.62       317

    accuracy                           0.82      1000
   macro avg       0.87      0.73      0.75      1000
weighted avg       0.84      0.82      0.80      1000

In [31]:
#Visualizing Model Results
cmap = sns.light_palette("lightcoral", as_cmap=True)
t1 = ConfusionMatrixDisplay(confusion_matrix(y_test, y_pred))
t1.plot(cmap=cmap, values_format='d', ax=plt.gca())
plt.show()
No description has been provided for this image
In [8]:
data = {
    'Model': ['Logistic Regression', 'Support Vector Machines', 'Naive Bayes'],
    'Precision': [0.80, 0.84, 0.76],
    'Recall': [0.99, 0.97, 0.98],
    'F1-Score': [0.88, 0.90, 0.85],
    'Accuracy': [0.82, 0.86, 0.77]
}

df = pd.DataFrame(data)
df_melted = df.melt(id_vars='Model', var_name='Metric', value_name='Score')
plt.figure(figsize=(12, 8))

metrics = ['Precision', 'Recall', 'F1-Score', 'Accuracy']
metric_indices = np.arange(len(metrics))

for model in df['Model']:
    model_data = df[df['Model'] == model].iloc[0, 1:].values
    # Interpolate data points for smooth curves
    x_smooth = np.linspace(metric_indices.min(), metric_indices.max(), 300)
    spline = make_interp_spline(metric_indices, model_data, k=3)  # Cubic spline
    y_smooth = spline(x_smooth)
    plt.plot(x_smooth, y_smooth, label=model, linewidth=2)

sns.scatterplot(data=df_melted, x='Metric', y='Score', hue='Model', s=100, palette='viridis')

# Add title and labels
plt.title('Comparison of Classification Metrics - Curvy Line Chart', fontsize=16)
plt.xlabel('Metric', fontsize=14)
plt.ylabel('Score', fontsize=14)
plt.legend(title='Model', title_fontsize='13', loc='lower right')

plt.show()
No description has been provided for this image

MODEL COMPARISION & SELECTION : Comparitively SVM appears to be the 'BEST model' here for the Stock Sentiment Data analysis as it perfoms relatively well across most of the metrics than Naive Bayes Classification and Logistic Regression. Though Precision and Reccall metric is good in Logistic regression Naive Bayes classification respectively.¶

DEPLOYMENT & MONITORING AND MODEL UPDATING¶

Once the Model is trained, the collected data from News articles and along with other sources such as Social Media (Expert opinions, blogs , free news resources ) can be again preprocessed as an update. (Tokenization etc) applying the NLP Techniques to classify the sentiment and calculate the sentiment score .¶

With this Model, Market sentiment signals can be established to monitor the stock market trends on a regular basis and the fluctuations investing behaviour, in public opinion, and market sentiment towards specific stocks and sectors which contributes to stock prices can be tracked.¶

Therefore ,utilize valuable insights from this model for making informed decisions in a timely manner ,Investment Strategies such as (Recommend, Buy, Sell, Hold).¶

Extensive analysis and actionable recommendations will help mitigate risks towards stocks. And above all a feedback loop enables data updation and analysis iteratively.¶

Picture1- flow chart.png

Challenges: One of the difficulties with sentiment analysis is the difficulty of measuring mixed sentiments. Bad news is good news for some stocks some times For example, an earthquake or a disaster will directly result in down fall of Insurance stocks while the home improvement retailers may find oppurtunity for rise in sale leading to price rise. Especially as and when the expert or analyst opinions are shared in news articles there will be a lot of disputing views against their claims .Someone may feel positively about Stock ,getting some insights from the Business or Company for which he or she owns a stock but negatively about the market. This type of issue requires careful modelling of sentiments.¶

Moreover , if the model is updated with Social media resources and blogs, many that do market analysis and be talking heads on social media have sort of inflated sentiment: they are most likely to post extreme good or bad opinions regarding a specific company or stock. There is also arguably a pretty strong self-selecting personality bias for the type of people that like to frequently post and interact publicly on Social media. Even if you were to control for language and just isolate for English, only a minority of English speakers even use Social media. The issue of bot-generated content on Social Media, meanwhile, is yet another problem : doing sentiment analysis at scale, you won't necessarily know what came from a human.Also the news articles do have the same kind of issue. Hence, Slowness to report breaking news , credibility , potential bias , tangibility is very subjective topic and needs careful scrutiny when analysing sentiments.¶

In [ ]: